%matplotlib inline
Let an RNA strand be modeled as $T$ a string, with alphabet $A = \{ A, U, C, G\}$, such that $T_i$ is the $i$-th letter and let $|T| = N$ be the size of $S$. Two nucleic acid polymers of a single RNA string can undergo basepairing interactions through hydrogen bonds (A with U and C with G) These interactions lead the RNA single stranded string to form intricate 2D and 3D structures, and often the secondary and tertiary structures allow the RNA strand to achieve a specific function (its phenotype). Let the RNA secondary structure be denoted as $S$.
Different algorithms have been proposed so far, to find a set of pairs $S$ that maximizes the free energy of the RNA string (objective function). In this work we will use the well-known Vienna algorithm to infer the secondary structure of an RNA string. Moreover, the similarity between two strings is assessed here using distance measures such as the Levenshtein distance.
A neutral network is a set of genes all that have equivalent function or fitness.[1] Each node represents a gene sequence and each line represents the mutation connecting two sequences. Neutral networks can be thought of as high, flat plateaus in a fitness landscape. During neutral evolution, genes can randomly move through neutral networks and traverse regions of sequence space which may have consequences for robustness and evolvability.
Interestingly, two RNA strings that are very similar (related by point mutations) are likely to have equivalent functions (secondary structures/MFE) and thus replacing one string by the other would not imply a penalty in terms of fitness: the system is robust to point mutations. Let us model be the set of RNA strings as $V$ a set of nodes, and let us consider that $E$ is a set of edges, such that $E \subseteq V^2$, and for all $e = {v_i, v_j}\in E$, $dist(v_i,v_j)=1)$ where $dist$ denotes the Levenshtein distance between strings $v_i$ and $v_j$, then $G=\rangle V,E\langle$ is a graph of "genotypes" (or RNA strings) and an induced subgraph containing all the nodes leading to the same phenotype would be a neutral network). In this practical work we will study the importance of neutral networks) on living organisms' evolvability, taking RNAs and their secondary structures as case study, as in Wagner 2007.
Questions:
Let us assume that each nucleotide has the same probability to be chosen. The function should also receive the probabilities to undergo each kind of mutation.
import RNA
) to compute the secondary structures of the sequences loaded in the previous section.fc = RNA.fold_compound(seq)
function to create an RNA fold object for sequence seq
fc.mfe()
method to find the secondary structure that optimizes the Minimal Free Energy (MFE). This methods returns a tuple containing the structure the MFE itself.The secondary structure is represented as a string, therefore the Levenshtein distance can be used to compare the secondary structures of the RNA strings as in the previous section.
NetworkX
graph, and make another function to plot it (hint: use the nx.kamada_kawai_layout
function to define a suitable layout). Are such graphs planar?Write a function that explores the RNA sequence space as well as the corresponding secondary structure space, starting from a single ancestral sequence $T$. The exploration should run for a given number of iterations (set as a parameter), and at each iteration:
Pick one sequence from the miRNA database, explore only the chosen string's neighborhood:
motifs
submodule of BioPython
)import RNA
import numpy as np
from tqdm.notebook import tqdm
import pandas as pd
from Levenshtein import distance
import matplotlib.pyplot as plt
import networkx as nx
import seaborn as sns
def generate_random_RNA_sequence(length=20, alphabet=["a","u","c","g"]):
seq = list(np.random.choice(size=length,a=alphabet,replace=True))
return "".join(seq)
def mutate(sequence,p=np.asarray([0,0,1])):
alphabet = ["A","C","G","U"]
mutations = ["insertion","deletion","substitution"]
p = p/p.sum()
mutation = np.random.choice(a=mutations,size=1,p=p)[0]
location = np.random.randint(low=0,high=len(sequence),size=1)[0]
#print(location,mutation)
if mutation == "deletion":
return sequence[:location]+sequence[location+1:]
elif mutation == "insertion":
return sequence[:location]+np.random.choice(a=alphabet,size=1)[0]+sequence[location:]
elif mutation == "substitution":
alphabet_new = list(set(alphabet).difference([sequence[location]]))
return sequence[:location]+np.random.choice(a=alphabet_new,size=1)[0]+sequence[location+1:]
def rnasecstruc_2_graph(rna,struct):
nodes = list(rna)
G = nx.Graph()
for i in range(1,len(rna)):
G.add_edge(rna[i-1]+"|"+str(i-1),rna[i]+"|"+str(i))
open_link = []
for i in range(len(rna)):
if struct[i] == "(":
open_link.append(rna[i]+"|"+str(i))
elif struct[i] == ")":
last_open = open_link.pop()
G.add_edge(last_open,rna[i]+"|"+str(i))
else:
pass
return G
def plot_secondary_structure(G):
pos = nx.kamada_kawai_layout(G)
nx.draw(G,pos=pos,with_labels=True,font_size=10,node_size=70,alpha=0.5)
def explore_space(ancestor_rna,generations,allow_neutral_network=True):
fc = RNA.fold_compound(ancestor_rna)
ancestor_mfe_struct, ancestor_mfe = fc.mfe()
neutral_network = {}
neutral_network["struct"] = [ancestor_mfe_struct]
neutral_network["rna"] = [ancestor_rna]
neutral_network["rna_distance"] = [0]
neutral_network["struct_distance"] = [0]
neutral_network["mfes"] = [ancestor_mfe]
neutral_network["generations"] = [-1]
neighborhood = {}
neighborhood["struct"] = []
neighborhood["rna"] = []
neighborhood["rna_distance"] = []
neighborhood["struct_distance"] = []
neighborhood["mfes"] = []
neighborhood["generations"] = []
rna = ancestor_rna
for i in tqdm(range(generations)):
rna_ = mutate(rna)
fc = RNA.fold_compound(rna_)
(mfe_struct, mfe) = fc.mfe()
phenotype_distance = distance(ancestor_mfe_struct, mfe_struct)
genotype_distance = distance(ancestor_rna, rna_)
if phenotype_distance < 1 :
neutral_network["rna"].append(rna_)
neutral_network["struct"].append(mfe_struct)
neutral_network["rna_distance"].append(genotype_distance)
neutral_network["struct_distance"].append(phenotype_distance)
neutral_network["mfes"].append(mfe)
neutral_network["generations"].append(i)
if allow_neutral_network:
rna=rna_
if phenotype_distance >= 1:
neighborhood["rna"].append(rna_)
neighborhood["struct"].append(mfe_struct)
neighborhood["rna_distance"].append(genotype_distance)
neighborhood["struct_distance"].append(phenotype_distance)
neighborhood["mfes"].append(mfe)
neighborhood["generations"].append(i)
neutral_network = pd.DataFrame(neutral_network)
neighborhood = pd.DataFrame(neighborhood)
return(neutral_network,neighborhood)
ancestor_rna = "ACACCUGGGCUCUCCGGGUACC"
neutral_network,neighborhood = explore_space(ancestor_rna,10000,allow_neutral_network=True)
run = pd.concat([neutral_network,neighborhood])
run.index = run["generations"]
#g = sns.jointplot(
# x=run["rna_distance"],
# y=run["struct_distance"],
# kind="kde",
#)
#g.set_axis_labels('Sec. struc. Levenshtein Distance', 'Nucleotide Levenshtein Distance', fontsize=16)
plt.plot(run["rna_distance"],run["struct_distance"],".",alpha=0.05)
plt.plot(run["struct_distance"],".",alpha=0.1)
plt.plot(run["mfes"],".",alpha=0.1)
plot_secondary_structure(rnasecstruc_2_graph(run.loc[100,"rna"], run.loc[100,"struct"]))
plot_secondary_structure(rnasecstruc_2_graph(run.loc[0,"rna"], run.loc[0,"struct"]))
weird = run["struct_distance"].idxmax()
plot_secondary_structure(rnasecstruc_2_graph(run.loc[weird,"rna"], run.loc[weird,"struct"]))
weird = run["struct_distance"].idxmax()
plot_secondary_structure(rnasecstruc_2_graph(run.loc[weird-1,"rna"], run.loc[weird-1,"struct"]))
ancestor_rna = "ACACCUGGGCUCUCCGGGUACC"
neutral_network_from_single,neighborhood_from_single = explore_space(ancestor_rna,
100000,
allow_neutral_network=False)
neutral_network_from_single.shape
neighborhood_from_single.shape
plt.hist(neighborhood_from_single["mfes"],alpha=0.3,log=True)
plt.hist(neutral_network_from_single["mfes"],alpha=0.3,log=True)
plt.hist(neighborhood_from_single["rna_distance"],alpha=0.3,log=True)
plt.hist(neutral_network_from_single["rna_distance"],alpha=0.3,log=True)
plt.hist(neighborhood_from_single["struct_distance"],alpha=0.3,log=True)
neighborhood_from_single.groupby("struct").count()["generations"].sort_values()
from Bio.Seq import Seq
from Bio import motifs
instances = [Seq("T".join(rna.split("U"))) for rna in neutral_network_from_single["rna"]]
m = motifs.create(instances)
sns.heatmap(pd.DataFrame(m.counts).T)
#m.weblogo("neighborhood_single.png")
from Bio.Seq import Seq
from Bio import motifs
instances = [Seq("T".join(rna.split("U"))) for rna in neighborhood_from_single["rna"]]
m = motifs.create(instances)
sns.heatmap(pd.DataFrame(m.counts).T)
#m.weblogo("neighborhood_single.png")
ancestor_rna = "ACACCUGGGCUCUCCGGGUACC"
neutral_network,neighborhood = explore_space(ancestor_rna,100000,allow_neutral_network=True)
neutral_network.shape
neighborhood.shape
plt.hist(neighborhood["mfes"],alpha=0.3,log=True)
plt.hist(neutral_network["mfes"],alpha=0.3,log=True)
plt.hist(neighborhood["rna_distance"],alpha=0.3,log=True)
plt.hist(neutral_network["rna_distance"],alpha=0.3,log=True)
plt.hist(neighborhood["struct_distance"],alpha=0.3,log=True)
neighborhood.groupby("struct").count()["mfes"].sort_values()
_=plt.hist(neighborhood.groupby("struct").count()["mfes"].sort_values(),log=True, bins=50)
from Bio.Seq import Seq
from Bio import motifs
instances = [Seq("T".join(rna.split("U"))) for rna in neighborhood["rna"]]
m = motifs.create(instances)
sns.heatmap(pd.DataFrame(m.counts).T)
#m.weblogo("neighborhood_single.png")
from Bio.Seq import Seq
from Bio import motifs
instances = [Seq("T".join(rna.split("U"))) for rna in neutral_network["rna"]]
m = motifs.create(instances)
sns.heatmap(pd.DataFrame(m.counts).T)
#m.weblogo("neutral_network_single.png")